# Long Video Understanding

Qwen2.5 VL 7B Instruct GGUF
Apache-2.0
Qwen2.5-VL is the latest vision-language model from the Qwen family, featuring powerful visual understanding and multimodal processing capabilities, supporting image and video analysis with structured output.
Image-to-Text English
Q
unsloth
8,427
4
Docscopeocr 7B 050425 Exp
Apache-2.0
docscopeOCR-7B-050425-exp is a model fine-tuned based on Qwen/Qwen2.5-VL-7B-Instruct, focusing on document-level OCR, long-context visual language understanding, and accurate image-to-text conversion of mathematical LaTeX formats.
Image-to-Text Transformers Supports Multiple Languages
D
prithivMLmods
531
2
Vamba Qwen2 VL 7B
MIT
Vamba is a hybrid Mamba-Transformer architecture that achieves efficient long video understanding through cross-attention layers and Mamba-2 modules.
Video-to-Text Transformers
V
TIGER-Lab
806
16
Videochat Flash Qwen2 5 7B 1M Res224
Apache-2.0
VideoChat-Flash is a multimodal model built upon UMT-L and Qwen2.5-7B-1M, supporting long video understanding with an extended context window of 1M.
Video-to-Text Transformers English
V
OpenGVLab
64
1
Qwen2.5 VL 3B Instruct 4bit
Qwen2.5-VL is the latest vision-language model in the Qwen family, featuring enhanced visual understanding, agent capabilities, and long video processing.
Text-to-Image Transformers English
Q
jarvisvasu
174
3
Internvl 2 5 HiCo R64
Apache-2.0
A video multimodal large language model enhanced by Long and Rich Context (LRC) modeling, improving existing MLLMs by enhancing the perception of fine-grained details and capturing long-term temporal structures
Video-to-Text Transformers English
I
OpenGVLab
252
2
Internvideo2 5 Chat 8B
Apache-2.0
InternVideo2.5 is a video multimodal large language model enhanced by Long and Rich Context (LRC) modeling, built upon InternVL2.5. It significantly improves existing MLLM models by enhancing the ability to perceive fine-grained details and capture long-term temporal structures.
Video-to-Text Transformers English
I
OpenGVLab
8,265
60
Llava Video 7B Qwen2 TPO
MIT
LLaVA-Video-7B-Qwen2-TPO is a video understanding model based on LLaVA-Video-7B-Qwen2 with temporal preference optimization, demonstrating excellent performance across multiple benchmarks.
Video-to-Text Transformers
L
ruili0
490
1
Longva 7B TPO
MIT
LongVA-7B-TPO is a video-text model derived from LongVA-7B through temporal preference optimization, excelling in long video understanding tasks.
Video-to-Text Transformers
L
ruili0
225
1
Videochat Flash Qwen2 7B Res224
Apache-2.0
A multimodal model built on UMT-L and Qwen2-7B, supporting long video understanding with only 16 tokens per frame and an extended context window of 128k.
Video-to-Text Transformers English
V
OpenGVLab
80
6
Apollo LMMs Apollo 7B T32
Apache-2.0
Apollo is a series of large multimodal models focused on video understanding, excelling in processing up to one-hour-long video content, supporting complex video QA and multi-turn dialogues.
Video-to-Text Transformers English
A
GoodiesHere
67
55
Apollo LMMs Apollo 1 5B T32
Apache-2.0
Apollo is a series of large multimodal models focused on video understanding, excelling in tasks such as long video content comprehension, temporal reasoning, and complex video question answering.
Video-to-Text
A
GoodiesHere
37
10
Longvu Llama3 2 1B
Apache-2.0
LongVU is a spatio-temporal adaptive compression technology designed for long video language understanding, aiming to efficiently process long video content and enhance language comprehension.
Video-to-Text PyTorch
L
Vision-CAIR
465
11
Oryx 1.5 7B
Apache-2.0
Oryx-1.5-7B is a 7B-parameter model developed based on the Qwen2.5 language model, supporting a 32K token context window and specializing in efficiently processing visual inputs of arbitrary spatial dimensions and durations.
Text-to-Video Supports Multiple Languages
O
THUdyh
133
7
Longvu Llama3 2 3B
Apache-2.0
LongVU is a spatio-temporal adaptive compression technology for long video language understanding, designed to efficiently process long video content.
Video-to-Text PyTorch
L
Vision-CAIR
1,079
7
Longvu Qwen2 7B
Apache-2.0
LongVU is a multimodal model based on Qwen2-7B, focusing on long video language understanding tasks and employing spatio-temporal adaptive compression technology.
Video-to-Text
L
Vision-CAIR
230
69
Llava Video 7B Qwen2
Apache-2.0
The LLaVA-Video model is a 7B-parameter multimodal model based on the Qwen2 language model, specializing in video understanding tasks and supporting 64-frame video input.
Video-to-Text Transformers English
L
lmms-lab
34.28k
91
Kangaroo
Apache-2.0
Kangaroo is a powerful multimodal large language model specifically designed for long video understanding, supporting bilingual dialogue (Chinese-English) and long video inputs.
Video-to-Text Transformers Supports Multiple Languages
K
KangarooGroup
163
12
Timesformer Large Finetuned K400
TimeSformer is a video classification model based on spatio-temporal attention mechanism, specifically designed for video understanding tasks.
Video Processing Transformers
T
fcakyon
254
0
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase